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Chapter 1 


Hashing 


1.1 Introduction 


Imagine we have a set of elements that we want to store and lookup via a particular key. This 
key could be of whatever type we choose — integer and string being two useful keys that come 
readily to mind. If we had a table (i.e. an array) of size tablesize, and had exactly tablesize 
values, it would be possible to store each value in a particular cell of that table. Then, if knew 
where a particular value was located, we could access it in constant time. 

Since in this example, we know exactly how many elements we have, it is possible to create 
a function that maps the key values to table indices. Our collection of tablesize elements each 
has a unique key, and our table has tablesize indices, so this function would need to generate a 
mapping that was one-to-one: each key maps to a unique index, and each index is generated by 
a unique key. If we had such a function, then we wouldn’t even need to know in advance where 
each record was. We could simply look up by key, and use our function to obtain an index given 
a key, and that index would be the location of the record containing this key. 

Now, consider a more general situation. In this situation, we are still going to insert records 
into this table based on key, and we will still want to retrieve a record based on that particular 
record’s key, and we would still like to do this in constant time. However, in this more general 
situation, we don’t know in advance how many records there will be, and we don’t know in 
advance what the particular keys of these records will be — and therefore don’t know in advance 
the keys we will be looking up. We only know the data type of the set of possible keys. What 
can we do now? 

Well, we can attempt to use the same strategy, or at least a similar one. First, create a 
function that maps keys to table indices. Then, when we want to insert or lookup a particular 
record, we again choose the appropriate index based on the key of this record. Regardless of 
the particular keys we end up having in whatever records we manipulate, those keys will get 
mapped to certain table indices and those are the particular table indices we use. 

This procedure — generating a table index from a key and inserting or searching for the 
record that contains that key at that index in the table — is known as hashing, and the fuction 
that converts from key to index is known as a hash function. A hash function takes as input a 
value K from a key space — a large set of values from which a limited set of particular keys will 
be chosen — and from that key, generates a legal index of a hash table. If a particular key, when 
passed into the hash function, generates a particular index, we say that the key hashes to that 
index. 
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There is a problem with this method — if we know our keys in advance, and have exactly as 
many records as the tablesize, we can make our function one-to-one...but in general we don’t 
know how many records we will have, and our keys can be any keys in the key space, so there 
is no way to make this hash function one-to-one. As a result, it is entirely possible that two of 
the particular keys we are using will hash to the same index. This situation, where two distinct 
keys hash to the same index, is known as a collison, and the various hashing methods we will 
study will vary based on how they resolve these collisions — i.e. how they actually arrange the 
values so that we can make two insertions both work even if the keys of those records hash to 
the same location. 


1.2 Making hashing work well 


In order to have as few collisions as possible, there are a few things we can do. First and 
foremost, our hash function should be designed to distribute keys evenly across the various cells 
of the table. That is, there should be a 1/tablesize chance that a randomly chosen key will hash 
to a particular index, regardless of the index. If this is true, then out of tablesize insertions, we 
are likely to have one in every cell. Now, of course, the real world does not work this perfectly 
—so we will get collisions — but they will be as infrequent as possible. 

Also, certain mathematical patterns need to be avoided when matching up a table size with 
an appropriate hashing function. In general, hash tables work rather nicely when the table size 
is a prime number. In those cases, using the hash function h(x) = x mod tablesize is generally 
a decent approach. This is what we are going to do. 

There is an entire science behind hash functions. However, what we are going to focus on in 
this class is how to make the structure actually work. That is, you can certainly mess around 
with the mathematics to make collisions less likely, but that doesn’t matter if you don’t know 
how to insert and lookup and delete in the first place. So, our approach will be to start out 
with good but simple hash functions, and discuss the structures in that context. This way, you 
can spend your time trying to learn the details of a hashing class itself, rather than spending 
your time trying to learn the mathematics of advanced hashing functions. Once you understand 
how to use the hash table, you can always go back and improve the hash function. If you are 
interested in the mathematics behind some better hash functions, you can read through section 
8.5 in the text on your own time. 

Most of the remainder of our discussion on hashing will deal with how to handle collision 
resolution — what we do when we have two values and they are both supposed to hash to the 
same cell in the table. 


1.3. Open Hashing 


The simplest form of hashing is known as open hashing. With this form of hashing, collisions are 
resolved by building a linked list off the table cell to hold multiple elements. If three elements all 
hash to 2, then the cell at index 2 holds a pointer to a linked list holding those three elements. 
If five elements all hash to 6, then the cell at index 6 holds a pointer to a linked list holding all 
five of those elements. If there are no elements that hash to 4, then the cell at index 4 holds a 
NULL pointer (or a pointer to an empty linked list, depending on how you implemented your 
hash table). 
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Figure 1.1: An open hash table with hash function h(x) = x mod 7 and elements 17, 8, 4, 13, 
29, 11, and 22. Only keys are shown in the above picture, not the records associated with them. 


Since each table cell has a chain of nodes built off of it, open hashing is also known as separate 
chaining. 

The insertion, lookup, and deletion operations are all rather straightforward with this im- 
plementation. To insert a value, hash on the record’s key to get the appropriate index, and 
then add this value to the linked list at that index. To perform a lookup on a key, hash on that 
key to get the appropriate index, and then search the linked list at that index for the record 
with your key. If you find your key, the search was successful; if you reach the end of the linked 
list without finding your key, then the search was unsuccessful. Finally, to delete a key and its 
associated record, hash on the key to get the appropriate index, and then search the linked list 
at that index for the record with the given key. If you find that record, remove its node from 
the linked list. If you reach the end of the list without finding such a record, then no record 
with that key exists in this table. 

What is the performance of open hashing? Well, clearly, it depends on the size of the linked 
lists. Since we can grow these lists without bound (except, of course, for our actual physical 
memory bound), clearly we can insert more values than there are cells in the table, something 
we won’t be able to do with the next form of hashing we look at. So, first let’s consider the case 
in which we have inserted a number of elements exactly equal to the table size; after that, we 
will consider the case where the number of elements we insert exceeds the table size. 

If the number of elements we insert is exactly equal to the table size, well, due to our hash 
function distribution — which should distribute values well — our expectation is that each of these 
values would end up in a different spot in the hash table. So, that means each of the linked list 
sizes is 1, and so insert, lookup, and remove are all O(1). That is the expected case. In the 
worst case, we have all of the values in one linked list, because all of the values have hashed to 
the same cell. In this case, if we have n elements, then the running time for lookup and removal 
(which requires a lookup in this case) is O(n). Insert can still work in O(1) time if we always 
insert at the front of the list; if we make a sorted insert, or always insert at the end of the list, 
then insert will be O(n) worst-case, just as lookup and removal are. 

If we have many elements in the table — more than there are cells — then even if the n 
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elements are perfectly distributed there will still be n/tablesize elements in each linked list. So, 
it would appear that the lists would all be proportional to n even in the average case, and so the 
running times of the traversal-based operations (lookup, removal, and sorted or at-end insert) 
would be O(n) even in the average case. However, we can avoid this if we place a bound on 
the size of a given linked list. If we know, for example, that no matter how many values we 
insert into the table, the size of a particular linked list never exceeds, say, 3, then we can think 
of all of our operations as being O(3) = O(1). How can we guarantee this? Well, one (bad) 
possibility is to prevent insertions that violate this rule. A better possibility is to rehash the 
table when a list exceeds an acceptable size. We will look at rehashing later, in the context of 
the next hashing strategy we look at, but the ideas will apply to open hashing as well. The basic 
idea is to increase the size of the table and then rehash all the elements (hence the name). By 
increasing the number of available slots in the table, we are likely to reduce the size of many of 
the lists — including the one that has recently grown too long. Again, we will see more details 
on this later on, but the point is that it 7s possible to solve this problem without limiting the 
number of elements in the table — if we want to place limits on the length of the lists, but allow 
the number of elements to grow freely, then the only other thing we can do is alter our remaining 
restriction, the size of the table, as needed. 

In summary, though, our running times for the traversal-based open hashing operations 
depend on the linked list size, and thus are O(1), 1) when the number of total nodes is bounded 
and we are talking about average case performance, and 2) when the number of nodes can 
grow without limit but list sizes have a constant upper bound. On the other hand, these same 
operations are O(n), 1) when the number of total nodes is bounded and we are talking about 
the worst case performance, and 2) when neither the total number of nodes nor the list sizes are 
bounded. 

There is a more complete and accurate analysis of hashing running times in the textbook; 
the above is a decent enough analysis for the purposes of this class. 


1.4 Closed Hashing 


The major problems with open hashing are the extra pointers needed and the cost of dynamically 
inserting and removing nodes. The dynamic cost is especially an issue — with every insert and 
remove it becomes necessary to make a call to the memory mananger. Our next collision 
resolution strategy avoids this cost. 

With closed hashing, the records stored in the table are stored directly in the table, rather 
than in linked lists pointed to by the table cells. This immediately raises the question of what 
we should do about collisions — after all, if we are to put records directly in the table, then we 
definitely have a problem if two keys hash to the same spot. 

If we have a record y, and we want to put it in a spot that is already occupied by another 
record x, our only options (if we are not going to lapse back into a form of open hashing by 
“dividing” the cell into two spaces) are to either: 


1. delete x so that we can put y there 
2. move «x elsewhere in the table, so that we can put y in the spot formerly occupied by «, 


3. place y elsewhere in the table 
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Choice 1 isn’t really an option at all — we’d like to keep our records in the table once we put 
them there. And, out of choices 2 and 3, choice 3 makes more sense — all else being equal (as it 
is here), we should find an alternative for the new record, rather than the old one. 

So, we want to put y in a different place in the table. However, this causes two new problems. 
First, where do we put it? Second, and more importantly, how do we find it again? The entire 
point of hashing was to have constant-time access to records by deriving their table index from 
the key value of the actual record. If we now stick y somewhere else in the table, we will still 
look in the original spot (where z is now) when looking for y, and since we won’t find y there, 
we would then need to search the table for y. Presumably, this is an O(n) operation, and so it 
appears that our constant-time performance is gone if we have even a single collision. 

However, all is not lost! There is a way out of this mess, and that way is to remember where 
we put y!!, Since we possibly are resolving many collisions in this manner, trying to keep some 
kind of a database of where each collided value is located would not be practical. However, we 
can also “remember” these “second-try” locations by having a standard sequence of new places 
to attempt to put y. That is, we decide in advance: any record whose key hashes to index a, 
if spot a is full, we will try a specific second spot (call it b). If that spot is full, we will try a 
specific third spot (call it c). And so on. If, for each cell in the table, we decide in advance on a 
specific order of “alternative” cells to try, then if we try to insert a record, and the original cell 
is full, we can simply try our list of “alternative” cells, one by one, until we find an empty cell. 
Then, when we go to look up the record, we won’t find it in the original cell, but again, we can 
try the “alternative” cells one by one until we do find our value. 

This process, which is what allows closing hashing to work, is known as probing, and the list 
of alternative cell indices is known as a probing sequence. It might seem like creating a probing 
sequence is a lot of work, but actually there are mathematical patterns that allow us to express 
a probing sequence quickly and easily. We can have a function that takes as a parameter the 
number of probes we have made so far (with our initial attempt, to access the original cell, being 
the “zero-th” probe, not the first probe). Each time a cell is full, our probe number increases 
by 1, and a new value is passed into the function. The numbers returned by this function, then, 
represent offsets from the original cell. So, in this manner, we don’t even need a separate list 
of alternative cells for each spot in the table — if our original cell is different, even if the list of 
offsets is the same we will still get different alternative cells. 

In short, we can now handle hashing via the following formula: 


H(x, i) = (h(x) + f(i)) % tablesize 


where h(x) is once again the function that gives us the original cell, and f(z) is our probing 
function. Since we only have a finite-sized table, we use the modulus function to wrap around 
the end of the table, hence the “% tablesize” at the end of the expression (recall that, in C++, 
the % operator signifies the modulus operation). 
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Then, on insertion, we simply try the table cells in the order: 


H(x, 0) 
H(x, 1) 
H(x, 2) 
H(x, 3) 
H(x, 4) 


until we find an empty cell to insert into. And, when we look up that value, we try the same 
cells in the same order, and will eventually find the value we are looking for. 
Because a particular record could be located in various places in the list, and not just at the 
specific spot that its key hashes to, closed hashing can also be referred to as open addressing. 
(Now, you may have a few questions about this method already — keep reading and hopefully 
they will be answered. We have by no means “plugged all the holes” yet...) 


1.5 Probing methods 


1.5.1 Linear probing 


The simplest form of probing is known as linear probing. It is with this method that we are going 
to illustrate the closed hashing method in full — i.e. it is in the context of linear probing that 
we are going to “plug all the holes” referred to above. Then, when we switch probing methods, 
these methods will still apply — only the probing sequence will be different. 

Linear probing is best described with a single equation: f(z) = 7. That is, as the number of 
probes increases by 1, the value of the offset also increases by 1, giving the following formula: 


H(x, i) = (h(x) + i) % tablesize 


and the following probing sequence: 


H(x, 0) = ( h(x) ) % tablesize = ( h(x) + 0 ) % tablesize 
H(x, 1) = ( h(x) + 1) % tablesize 
H(x, 2) = ( h(x) + 2) % tablesize 
H(x, 3) = ( h(x) + 3) % tablesize 
H(x, 4) = ( h(x) + 4) % tablesize 


So, if we have a table of size 7, indexed from 0 to 6, with a hash function h(x) = x mod 7, 
the sequence of insertions of records with the keys will appear as in the following diagrams. 
(NOTE: for all closed hashing examples, we are only going to draw the key inside the table, not 
the actual record, and we are only doing to talk about the key. So, keep in mind that the key is 
part of a record, and that that entire record would appear in the table, not just the key. We’re 
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Figure 1.2: A closed hash table with hash function h(x) = z mod 7 and linear probing. (a) The 
empty table. (b) After inserting 9. (c) After inserting 11. (d) After inserting 23. (e) After 
inserting 3. (f) After inserting 2. 


just making a simplification for the purposes of discussion, since it won’t affect any of what we 
discuss. ) 


We’ve already covered the methods for successful inserts and successful lookups — successful 
insertions simply involve checking the cells in the probe sequence one-by-one until an empty 
space is found, and successful lookups simply involve checking the cells in the probe sequence 
one-by-one until the key we are looking for is found. 


Unsuccessful inserts can only happen here if we move through every element in the table one 
by one and they are all full. The easiest way to handle this is to just compare the number of 
probes, 7, to the tablesize at each step. Once i exceeds the tablesize, the insert is unsuccessful. 
But, what about unsuccessful lookups? Do we need to search through the entire table searching 
for a value that is not there? What if we have a size 7 table with one value, 9, at location 2 
(diagram (b) above)? If we now lookup the value 16, it seems that it would be a hideous waste 
to have to progressively search locations 2, 3, 4, 5, 6, 0, and finally, 1, in order to confirm that 
16 is not in the table. 

And, as it turns out, we don’t need to do that. The reasoning is as follows: if 16 had been 
inserted before 9 was inserted, then 16 would be in location 2 and 9 would be somewhere else. 
Since 9 is in location 2, this is obviously not the case — 9 must have been inserted before 16. 
Now, when 16 was inserted, location 2 was full, so location 3 would have been tried next (since 
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this is linear probing). The only reason we would skip over location 3 would be if location 3 
was full. Note that in diagram (f) above, when we go to insert 2, we skip over locations 2, 3, 4, 
and 5, since they are all full. The same would happen if we were trying to insert 16 — we keep 
probing until we find an empty location. Since location 3 is empty, we know immediately that 
16 is not in the table. If it were, either 16 would be in location 3, or 16 would be elsewhere 
because some other value was in location 3. But, in this example, nothing is location 3, and 
the situation where we skip over an empty location 3 and insert 16 elsewhere in the table is a 
situation that cannot happen. So, we know 16 is not in the table. 

That is, when we do a lookup, we search the cells indicated by the probing sequence, one by 
one, until we either find the value or reach an empty location. If we find the value, the search was 
successful. If we reach an empty location, then due to the way we handle insertions, we know 
that the value we are looking for is not in the table, and therefore our lookup is unsuccessful. 

Note!: if the table is very full, this will still result in searching a great deal of the table. In 
fact, in the worst case, the table is completely full and does not hold our value. In this case we 
will search the entire table before recognizing — again due to 7 exceeding the tablesize — that we 
have looked at every cell. So, in the worst case insertions and lookups are O(n). What this tells 
us is that letting the table fill up is bad. 

There is one more issue to discuss, and that is the issue of deletion. On the surface, deletion 
seems as if it would be rather straightforward — if we want to delete y, find y in the table, and 
then remove it from the table, making that cell empty. However, this creates a serious problem 
with lookups. Remember that our entire lookup strategy was based on the fact that we would 
not “skip over” possible cells. But, what if a cell was emptied after an insertion that went past 
it? Consider, for example, a table of size 7, where we have done three insertions that all hashed 
to the value 2 — 9, 23, and 16. If the insertions are made in that order, then 9 will be entered 
in cell 2, 23 will collide at cell 2 but will be inserted in cell 3, and 16 will collide at cells 2 and 
3 but will be inserted in cell 4. 

Now, delete 23. When we go to lookup 16, our lookup procedure will notice that cell 2 is 
full, and move on to cell 3. Then, it will notice that cell 3 is empty, and will return that the 
search was unsuccessful. But, 16 is in the table! The problem here is that our insertion was 
based on the former status of cell 3, while the lookup is based on the current status of cell 3. If 
those are different due to a deletion, then our previous work all falls apart. This can be seen in 
the diagrams on the next page. 

The solution? Remember the previous status of the cell! That is, there is a difference between 
cells that have always been empty — which the lookup procedure can skip over — and cells that 
are empty now but which used to be full. That latter type of cell, the lookup procedure should 
not skip over, because the value being searched for could have been inserted at a time when this 
now-empty cell was full. 

So, we want an extra piece of information in the cell — a status field that will tell us if the 
cell is: 


e valid — contains a legal value 
e empty — never held a value, or 
e deleted — empty now, but once full 


When the table is first allocated, all cells are marked as being “empty”. As we perform 
insertions, the status of those cells changes to “valid”. And, if we perform a deletion, the status 
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Figure 1.3: The problem of deletion. (a) A closed hash table with hash function h(x) = x mod 7 
and elements 9, 23, and 16, inserted in that order. (b) 23 is completely deleted from the table. 
(c) Now, 16 cannot be found because our lookup procedure stops at the first empty spot it finds. 


of the cell changes from “valid” to “deleted”. 

(NOTE: we would actually appear to need this status field anyway. After all, when reading 
a cell, how do we know if the value in that cell is meaningful or if it is garbage? It would seem 
that we at least need a status marker to tell us if the cell is empty or valid, even if we are not 
worried about deletion. However, if there is some way to tell if the cell is empty or not — if 
perhaps there is a particular value the cell holds when it is empty that it can’t hold when it is 
full (for example, if the cell holds only positive integers and thus holds -1 to signify “empty” ) 
— then we would only need a status field to tell us if the cell was 1) empty and not deleted, or 
2) empty and deleted, since it would be clear whether or not the cell was empty or full, and the 
value of the status field would be ignored when the cell was full. In this case, we can actually 
use only a single bit to serve as this status field, since we have only two choices, and so this 
status field is sometimes called a deleted bit.) 

Can we still insert into to a “deleted” cell? Certainly. There is no use skipping over a cell 
simply because it used to hold a legal value but doesn’t anymore. In this case, we would change 
the status of the cell from “deleted” back to “valid”. 

But, for lookups, the “deleted” cell is treated the same way as a “valid” cell — the lookup 
procedure will move on to check the next cell in the probing sequence. Only for “empty” cells 
will the lookup procedure stop and return that the search was unsuccessful. 


4 


So, in a sense, you can think of the “deleted” status as follows: 
e deleted is equivalent to empty if you are doing an insert 
e deleted is equivalent to valid if you are doing a lookup 


This method of deletion — where we don’t actually shift anything around in the structure, 
but simply make a notation that this cell no longer counts — is known as “lazy deletion”. You can 
use this idea in other data structures too — linked lists, for example, could simply have a node 
marked “ignore this” instead of having the node removed — but in those situations, you truly 
are being lazy, and you should be prepared to justify why you think lazy deletion is acceptable. 
With closed hashing, the costs of actual deletion and readjustment are too great, and so we are 
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willing to settle for lazy deletion as a reasonable alternative that still provides good performance. 
The following diagrams illustrate this solution. 
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Figure 1.4: The deletion problem solved. (a) A closed hash table with hash function h(x) = 
x mod 7 and all cells initialized to empty. (b) The values 9, 23, and 16 inserted into the table. 
Note that their status fields change to valid. (c) 23 is deleted, and its status field changes to 
deleted. (d) Now, a lookup of 16 skips over the cell formerly containing 23, because the status 
field of deleted signifies to treat this cell as valid for the purposes of lookup. (e) If you insert 
a value that wants to use this cell, simply change the status field back to valid. 


That, then, covers all the hash table routines — you can now insert successfully into a table, 
you know how to recogznize unsuccessful inserts (inserts into full tables), you can perform 
successful and unsuccessful lookups in hash tables, and you know how to handle removals in a 
manner that does not prevent the other routines from working correctly. The only remaining 
questions are: 


1. How good is linear probing? Are there any problems with it? 


2. What other types of probings are possible? Are they better than linear probing? What 
problems do these methods have? 


3. How can we handle situations where the table is full, or where we don’t expect many more 
inserts but have so many deleted cells that the performance of our lookups has degraded 
to a worse point than it needs to be at? 


Well, first things first. 
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1.5.2. The problem with linear probing 


The problem with linear probing is that as the table starts to fill up, you get larger and larger 
clumps of values. Because linear probing moves down the table value by value, as soon as an 
insert has a collision, you basically traverse down cell by cell until you reach an empty spot. 
Now, if a new value collides with either the initial value, the ending value, or any of the values 
in-between, it will end up at the bottom of the clump and add to the size of the clump, making 
it a larger target. These large clumps degrade performance to the point where we get worst case 
instead of average case time. 

This problem is known as the primary clustering problem — as clumps grow, they become 
bigger and bigger targets, and thus are more likely to grow some more. 
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Figure 1.5: Primary clustering. (a) A hash table with a cluster of values near the top. (b) 
Attempted insertion anywhere in this cluster (the top, in this case) results in probing all the 
way down to the bottom of the cluster...and also results in an addition to the cluster. (c) Two 
clusters, with some empty cells in between (indicated by the black square). Three attempted 
insertions into the top cluster will fill these cells as in diagram (b), and then the two clusters 
will be merged into one even larger cluster. 


We can get better performance if the probing sequences don’t overlap to such a large degree. 
If the alternate cells we check for cell 2 are largely different than the alternate cells we check 
for cells 3 and 4, then we are unlikely to experience primary clustering as we insert values that 
initially probe to cells 2, 3, and 4. 


1.5.3. Quadratic Probing 


One way to solve the problem of primary clustering is to use a different type of probing, known 
as quadratic probing. Quadratic probing is not much different than linear probing. The insert, 
lookup, and deletion procedures still work the same way. There is one major difference, though, 
and that is the probing function itself. For quadratic probing, f(i) = 7?. 

This gives us the formula: 


H(x, i) = (h(x) + i°2) % tablesize 
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and the following probing sequence: 


H(x, 0) = ( h(x) ) % tablesize = ( h(x) + 0) % tablesize 
H(x, 1) = ( h(x) + 1) % tablesize 
H(x, 2) = ( h(x) + 4) % tablesize 
H(x, 3) = ( h(x) + 9) % tablesize 
H(x, 4) = ( h(x) + 16) % tablesize 
H(x, 5) = ( h(x) + 25 ) % tablesize 


So, for example, if we have h(x) = x mod 7, then a value that hashes to 2 would follow the 
probing sequence 2, 3, 6, 11, 18, 27, 38, and so on. Whereas, a value that hashes to 3 would 
follow the probing sequence 3, 4, 7, 12, 19, 28, 39, and so on. Note the difference! With linear 
probing, the two sequences would have been 2, 3, 4, 5, 6, 7, 8,... and 3, 4, 5, 6, 7, 8, 9.... 
meaning there is a lot of overlap between the sequences which results in the primary clustering 
effect. With quadratic probing, there is very little overlap between the sequences (and that only 
because 2 and 3 are right next to each other; 2 and 4 would have no overlap at all). Therefore, 
primary clustering is basically eliminated. 
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Figure 1.6: Quadratic probing. (a) Our now-usual hash table. We will insert many values that 
all hash to 2. (b) Inserting 9. (c) Inserting 23. (d) Inserting 16. Note that after trying an 
offset of 1, you try an offset of 4, not 2. (e) Inserting 2. Note that after trying offsets of 1 and 
4, an offset of 9 is tried. This wraps around the table: 2+ 9 = 11 = 4(mod7). 
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Note that in the diagrams of figure 1.6, if the table had been larger, the value 2 would have 
gone into an actual cell 11, the next value would be inserted into cell 2 + 16 = 18, and so on. 
This would leave plenty of space in between these values for the values that hash to cells such 
as 4, 7, 10, and 14. We’ve eliminated primary clustering! 

However, there is still the problem of values hashing to the same point following the same 
probing sequence. For example, any value hashing to 2 will follow the 2, 3, 6, 11... probing 
sequence listed above. This means that the fourth value inserted that hashes to 2 will automat- 
ically collide three times before we are able to insert it corrrectly (assuming we have not done a 
deletion yet). Each additional value that hashes to an index will collide with all the values we 
have already inserted that initially hashed to that index. This problem is known as secondary 
clustering — the values that hash to different cells no longer cluster together, but the values that 
hash to the same cell still do cluster together. If possible, we would like to eliminate that effect 
as well (and we will in the next section). 

In addition, there is an interesting problem that comes up with quadratic probing: take the 
hash table of diagram 1.6e, and try to insert the value 30. You will find that 30 cannot be 
inserted in the table — your probing sequence, wrapping around the table due to the modulus 
operator, will continually refer you to cells that are already full. It turns out that quadratic 
probing cannot guarantee successful insertion if the the table is half full or more than half full!! 

The solution to this is to rehash once the table is more than half full. We will look at 
rehashing at the end of these notes. 


1.5.4 Double Hashing 


To solve the secondary clustering problem, it will be necessary to have different probing sequences 
not just for different cells, but for different values that hash to the same cell. For example, as 
long as we have the same probing sequence for any value that hashes to 2 — whether it is 9, 23, 
30, or 16 — then if we insert 9, 23, 30 and 16, they will form a secondary cluster because they 
all have the same probing sequence. However, if somehow our probing sequence could depend 
on the specific value being inserted — that is, if the probing sequences for 9, 23, 30, and 16 were 
different even though they all hashed to 2 — then we would eliminate not only primary clustering, 
but secondary clustering as well. 

Double hashing is a probing method that does exactly what the name implies — namely, it 
hashes a second time. The basic idea is the following: f(i) = i* he(x). That is, our probing 
function offset is a multiple of the result of some second hashing function. The value returned 
by this second hashing function doesn’t change as we do more probes — the value is determined 
by xz, not 7. All that happens as we perform a second, third, fourth, etc. probe is that our 
function f(z) generates increasing multiples of this value. 

For example, let h(x) = x mod 7 and ho(x) = 5 — x mod 5. If you are inserting 9, the value 
will of course hash to 2. In this case, there was no collision, so we don’t need the probing offset. 
When we go to insert 23, that also hashes to 2, but since that cell is full, we must make our first 
probe into the table. h2(23) = 2, and so our first probe occurs at the cell 2+ 1%*2= 4. If that 
cell was also filled, then our next probe would occur at the cell 2+ 2 * 2 =6. The next probe 
would be (2+3%*2) mod 7 = 1, the next at (2+ 4 * 2) mod 7 = 3, and so on. 

But, these values are only for an attempted insert of 23! If instead of 23, we were trying to 
insert 16, then our probing sequence would be different. h2(16) = 4, so once we realize cell 2 
is full, the next cell we try would be 2+1%*4= 6. If that cell was full, we would try the cell 
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(2+ 24) mod 7 = 3. If that cell was full, we would try the cell (2+3 4) mod 7 = 0, and then 
the cell (2+ 4 * 4) mod 7 = 4, and so on. 
So, our series of probes for a given « would be as follows: 


H(x, 0) = ( h(x) ) % tablesize = ( h(x) + O*h_2(x) ) % tablesize 
H(x, 1) = ( h(x) + 1*h_2(x) ) % tablesize 
H(x, 2) = ( h(x) + 2*h_2(x) ) % tablesize 
H(x, 3) = ( h(x) + 3*h_2(x) ) % tablesize 
H(x, 4) = ( h(x) + 4*h_2(x) ) % tablesize 
H(x, 5) = ( h(x) + 5*h_2(x) ) % tablesize 


However, the key point here is that even the probing function varies with x!. So, for a 
different x with the same h(x) value, you still likely have a different ho(x), and therefore you 
have a different probing sequence — the same pattern, listed above, but a different actual probing 
sequence because even if the h(x) values are the same for the two keys, the ho(x) values are 
likely to be different! 

That is, not only do we not get primary clustering, but we don’t get secondary clustering 
either. 


Example: h(x) = x mod 7, h_2(x) = 5 - x mod 5 
probing sequence for 2: 
23 55 1 Ay ges 
probing sequence for 9, which also hashes to 2: 
23.3574 Boe: 
probing sequence for 16, which also hashes to 2: 
25. 65,3 Ope2 4 
probing sequence for 23, which also hashes to 2: 
2 45. -6 5. Ayes 
All the above values hash to 2, but each has a different second value! 
So, you no longer have the problem of values that hash to the same spot following the same 
probing sequence. Not only is the probing sequence significantly different for each starting value, 
it is also significantly different for many different values that hash to the same value. You can 


still have multiple collisions on a single insertion, but it is less likely to happen for double hashing 
than it is for linear or quadratic probing. 


1.6. REHASHING 17 















































23 




















Dnnk WN YF SO 

Dnnk WN YF SO 
\o 

Dnnk WN YF CO 
\o 

























































































(a) (b) (c) 
0 0 
1 1 
2\9 2| 9 
3 3 
4] 23 *| 23 
5 Bi 
6| 16 6] 16 
(d) (e) 


Figure 1.7: Double hashing. (a) Our usual hash table, with h(x) = z mod7. This time, we 
have a second hash function, he(x) = 5 — x mod 5. (b) We insert 9 with no collision. (c) 23 is 
inserted, collides once, and is then placed in cell 4. (d) 16 collides once before being placed in 
cell 6. (e) 2 collides once before being placed in cell 5. It is important to notice that, unlike 
with the previous methods of probing, in this example each value had only one collision! 


1.5.5 Closed hashing performance 


If we manage to keep our table at least somewhat empty (usually we don’t want our table to 
be any more than half full, or performance starts to seriously degrade), and our hash function 
gives us the good random distribution we expect, then we get our average case performance, 
which is O(1) for insert, lookup, and remove. If, on the other hand, we have a very full table, 
or a large cluster, or both, we are likely to have to search many values to find the particular cell 
we are after. In the worst case, we can get O(n) performance. As with open hashing, there is 
a more accurate analysis you can explore, but the above will be good enough for this class. In 
the more accurate analysis, you can show that double hashing will provide better performance 
than linear or quadratic probing will provide you. 


1.6 Rehashing 


We have one more question left to answer: what happens when the table becomes completely 
filled up? Or, since we want to keep the table somewhat unfilled to guarantee good performance, 
what happens when the number of cells that are filled exceeds a certain percentage of the total 
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number of cells available? The answer here is to perform a rehashing. 

Rehashing is actually quite simple to describe: create a new, larger table, take each of the 
values in the old table, and insert them into the new table. The new table then becomes the 
hash table you care about (and the old one can be deleted). Since you are hashing these values 
again — into a new table — we therefore get the name “rehashing”. 

The exact details of this method, though, deserve a bit of discussion. The major question 
is: how big do we make the new table? It would seem that if we make the table only one cell 
larger, we are going to have to rehash again very shortly. So, it would make sense to make the 
table significantly larger. However, we don’t want to make it so large that we are wasting large 
amounts of memory — instead, it would be better to grow the table to a size just large enough 
that we don’t need to do another rehashing very soon, but not so large as to be unreasonable. 
Otherwise, why not just allocate a table of size 1 trillion table from the start? We want the 
table size to grow, but to grow gracefully. 

In practice, doubling the size of the table works nicely, for reasons we shall explain shortly. 
However, we cannot exactly double it, because we want our table size to remain prime. So, what 
we are going to do is double the table size, and then increase that calculated size upward to the 
next largest prime number we encounter. For example, a table of size 7 would first increase to 
14, and then settle at 17. A table of size 11 would first increase to 22, and then settle at 23. 
And so on. 
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Figure 1.8: Rehashing. (a) With our fourth insertion into our quadratic probing table from 
figure 1.6, we have a table that is over 50% full. So, we will rehash this table. (b) The table size 
should be the smallest prime number that is at least double the old table size. (c) The values 
from the old table, rehashed into the new table using the hash function h(x) = x mod 17. 


Let’s define load factor to be the number of elements in a table divided by the number of 
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cells. It is basically a “percent full” mark — a load factor of .50 means the table is 50 percent 
full. Lets rehash the table when the load factor exceeds .50. In this situation, if the table size 
is 2n, we have n values we need to rehash. Assuming we get expected case performance, and 
each hash into the new table takes constant time, we are still left with O(n) time for the rehash. 
However, we now have a table only a quarter full, and will need to insert another n elements 
before a rehash is needed. So, you can think of the rehash time as being divided among those 
subsequent n insertions, and as a result the cost of rehashing doesn’t stick out as much as you 
might expect. Over the long term, it will simply appear as if each of our insertions just got a 
bit slower (but still ran in O(1) time in the average case). 


